Morphology Based Automatic Acquisition of Large-coverage Lexica
نویسندگان
چکیده
In this article, we introduce a new technique for constructing wide-coverage morphological lexica from large corpora and morphological knowledge, with an application to French. Basically, it relies on the idea that the existence of a hypothetical lemma can be guessed if several different words found in the corpus are best interpreted as morphological variants of this lemma. We first validated our technique by extracting verbs and adjectives on a general French corpus of 25 million words. Compared with other lexical resources available for French, our results are very satisfying, since we cover many words, often derived words, that are not always present in other lexica. Application of our algorithm to the acquisition of domain-specific adjectives on a botanic corpus gave also very good results, thus demonstrating its usability to extract domain-specific lexica. Moreover, it is generalizable to any language with a substantial morphology. Part of the resulting lexicon (currently verbal forms) is already freely available on http://www.lefff.net/.
منابع مشابه
Specifications of Building Polish Lexica for Application in ASR and TTS Systems
This paper brings detailed information concerning the specifications of building Polish lexica of common and special application words for use in speech applications such as ASR (automatic speech recognition) or TTS (text-to-speech) synthesis. The specifications include information on the collection of text corpora and word lists, phonetic, grammatical and morphological annotation, as well as s...
متن کاملEvaluating and improving syntactic lexica by plugging them within a parser
We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon (Sagot, 2010), and their integration within the large-coverage TAG-based FRMG parser (de La Clergerie, 2005). The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations an...
متن کاملATOLL - A framework for the automatic induction of ontology lexica
There is a range of large knowledge bases, such as Freebase and DBpedia, as well as linked data sets available on the web, but they typically lack lexical information stating how the properties and classes they comprise are realized lexically. Often only one label is attached, if at all, thus lacking rich linguistic information, e.g. about morphological forms, syntactic arguments or possible le...
متن کاملSpanish Lexical Acquisition via Morpho-Semantic Constructive Derivational Morphology
This paper describes an algorithm for Spanish derivational morphology whose output is generalizable to two different lexicon acquisition situations. One is the process of automatic lexicon acquisition via the use of Morpho-Semantic Lexical Rules (MSLRs), (Viegas, Gonzalez, & Longwell 1996) usable in semantically based Natural Language Processing(Nirenburg, et al 1996) in order to considerably r...
متن کاملThe Role of Morphology in Generating High-Quality Pronunciation Lexica for Regional Variants of Portuguese
Grapheme to phoneme (GTP) systems for languages such as English, German, and Korean have been shown to achieve better performance rates with the inclusion of a morpho-phonological preprocessing component. While semiautomatic and automatic GTP approaches for Portuguese continue to achieve steady gains, such algorithms do not take morphology into account, despite a growing need to do so, based in...
متن کامل